photorealistic image
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States > Texas (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States > Texas (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)
Kwartler, Ted, Bagan, Nataliia, Banny, Ivan, Aqrawi, Alan, Abbasi, Arian
The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.
- Research Report > Promising Solution (0.34)
- Research Report > Experimental Study (0.34)
- Health & Medicine (0.95)
- Information Technology > Security & Privacy (0.88)
- Law (0.69)
Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models
Conwell, Colin, Tawiah-Quashie, Rupert, Ullman, Tomer
Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (`approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to `grounded' multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion'), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at https://github.com/ColinConwell/T2I-Probology
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
DART: An Automated End-to-End Object Detection Pipeline with Data Diversification, Open-Vocabulary Bounding Box Annotation, Pseudo-Label Review, and Model Training
Xin, Chen, Hartel, Andreas, Kasneci, Enkelejda
Swift and accurate detection of specified objects is crucial for many industrial applications, such as safety monitoring on construction sites. However, traditional approaches rely heavily on arduous manual annotation and data collection, which struggle to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an automated end-to-end pipeline designed to streamline the entire workflow of an object detection application from data collection to model deployment. DART eliminates the need for human labeling and extensive data collection while excelling in diverse scenarios. It employs a subject-driven image generation module (DreamBooth with SDXL) for data diversification, followed by an annotation stage where open-vocabulary object detection (Grounding DINO) generates bounding box annotations for both generated and original images. These pseudo-labels are then reviewed by a large multimodal model (GPT-4o) to guarantee credibility before serving as ground truth to train real-time object detectors (YOLO). We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current implementation of DART significantly increases average precision (AP) from 0.064 to 0.832. Furthermore, we adopt a modular design for DART to ensure easy exchangeability and extensibility. This allows for a smooth transition to more advanced algorithms in the future, seamless integration of new object categories without manual labeling, and adaptability to customized environments without extra data collection. The code and dataset are released at https://github.com/chen-xin-94/DART.
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Materials > Metals & Mining (1.00)
- Machinery > Construction Machinery & Heavy Trucks (1.00)
- Energy (1.00)
- Construction & Engineering (1.00)
Semantic Augmentation in Images using Language
Yerramilli, Sahiti, Tamarapalli, Jayant Sravan, Kulkarni, Tanmay Girish, Francis, Jonathan, Nyberg, Eric
Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to train these diffusion models, we propose a technique to utilize generated images to augment existing datasets. This paper explores various strategies for effective data augmentation to improve the out-of-domain generalization capabilities of deep learning models.
MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
Bellagente, Marco, Brack, Manuel, Teufel, Hannah, Friedrich, Felix, Deiseroth, Björn, Eichenberg, Constantin, Dai, Andrew, Baldock, Robert, Nanda, Souradeep, Oostermeijer, Koen, Cruz-Salinas, Andres Felipe, Schramowski, Patrick, Kersting, Kristian, Weinbach, Samuel
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States > Texas (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Google's AI-powered search tool can help tackle your holiday shopping
Google is scaling up Search Generative Experience (SGE) for holiday shopping. The company announced Thursday that its AI-powered search bot can now spit out gift ideas, photorealistic images of product types and virtual try-ons of men's tops. Google SGE launched in May, offering AI-driven answers and suggestions to complement the search engine's standard web results. The company has since added follow-up queries, better translations and interactive definitions in more complex subjects. The tool requires Chrome on desktop or the Google mobile app on smartphones.
- Retail (0.62)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.62)
- Consumer Products & Services (0.62)
AI imager Midjourney v5 stuns with photorealistic images--and 5-fingered hands
On Wednesday, Midjourney announced version 5 of its commercial AI image synthesis service, which can produce photorealistic images at a quality level that some AI art fans are calling creepy and "too perfect." Midjourney v5 is available now as an alpha test for customers who subscribe to the Midjourney service, which is available through Discord. "MJ v5 currently feels to me like finally getting glasses after ignoring bad eyesight for a little bit too long," said Julie Wieland, a graphic designer who often shares her Midjourney creations on Twitter. "Suddenly you see everything in 4k, it feels weirdly overwhelming but also amazing." Wieland shared some of her Midjourney v5 generations with Ars Technica (seen below in a gallery and in the main image above), and they certainly show a progression in image detail since Midjourney first arrived in March 2022.
How to choose a Sentence Transformer from Hugging Face
As a quick recap, Domain largely describes the high-level notion of what the dataset is about. In addition to Domain, there are many Tasks used to produce vector embeddings. Unlike language models, in which most models use the training task of "predict the masked out token", embedding models are trained in a much broader set of ways. For example, Duplicate Question Detection might perform better with a different model than one trained with Question Answering. It is a good rule of thumb to find models that have been trained within the same domain as your use case.